Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition causing 'ack timeout 4' disconnects #4

Merged
merged 1 commit into from
Apr 11, 2021
Merged

Fix race condition causing 'ack timeout 4' disconnects #4

merged 1 commit into from
Apr 11, 2021

Conversation

mmakaay
Copy link

@mmakaay mmakaay commented Apr 11, 2021

A pull request, as requested by @OttoWinter in the esphome/issues#1921 (comment)

The AsyncClient::send() methods sets a boolean to true after pushing
data over the TCP socket successfully using tcp_output(). It also sets
a timestamp to remember at what time the data was sent.

The AsyncClient::_sent() callback method reacts to ACKs coming from
the connected client. This method sets the boolean to false.

In the AsyncClient::_poll() method, a check is done to see if the
boolean is true ("I'm waiting for an ACK") and if the time at which
the data was sent is too long ago (5000 ms). If this is the case,
a connection issue with the connected client is assumed and the
connection is forcibly closed by the server.

The race condition is when these operations get mixed up, because
of multithreading behavior. The _sent() method can be called during
the execution of the send() method:

  1. send() sends out data using tcp_output()
  2. _sent() is called because an ACK is processed, sets boolean to false
  3. send() continues and sets boolean to true + timestamp to "now"

After this, the data exchange with the client was successful. Data were
sent and the ACK was seen.
However, the boolean ended up as true, making the _poll() method think
that an ACK is still to be expected. As a result, 5000 ms later, the
connection is dropped.

This commit fixes the code by first registering that an ACK is
expected, before calling tcp_output(). This way, there is no race
condition when the ACK is processed right after that call.

Additionally, I changed the boolean to an integer counter value.
The server might send multiple messages to the client, resulting in
multiple expected ACKs. A boolean does not cover this situation.

The AsyncClient::send() methods sets a boolean to true after pushing
data over the TCP socket successfully using tcp_output(). It also sets
a timestamp to remember at what time the data was sent.

The AsyncClient::_sent() callback method reacts to ACKs coming from
the connected client. This method sets the boolean to false.

In the AsyncClient::_poll() method, a check is done to see if the
boolean is true ("I'm waiting for an ACK") and if the time at which
the data was sent is too long ago (5000 ms). If this is the case,
a connection issue with the connected client is assumed and the
connection is forcibly closed by the server.

The race condition is when these operations get mixed up, because
of multithreading behavior. The _sent() method can be called during
the execution of the send() method:

1. send() sends out data using tcp_output()
2. _sent() is called because an ACK is processed, sets boolean to false
3. send() continues and sets boolean to true + timestamp to "now"

After this, the data exchange with the client was successful. Data were
sent and the ACK was seen.
However, the boolean ended up as true, making the _poll() method think
that an ACK is still to be expected. As a result, 5000 ms later, the
connection is dropped.

This commit fixes the code by first registering that an ACK is
expected, before calling tcp_output(). This way, there is no race
condition when the ACK is processed right after that call.

Additionally, I changed the boolean to an integer counter value.
The server might send multiple messages to the client, resulting in
multiple expected ACKs. A boolean does not cover this situation.
Copy link
Owner

@OttoWinter OttoWinter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 🎉

@Misiu
Copy link

Misiu commented Apr 15, 2021

@mmakaay what about ESPAsyncTCP-esphome?
AsyncTCP is used on ESP32, but ESPAsyncTCP is used on ESP8266 (ref: https://github.com/esphome/esphome/pull/1693/files#diff-1988fae5ec01f4c733d17749246f0facf70a29738259590d1bdf9225ad10cd3dR10-R15), that many of us still use.
I'm not sure if this error occurs there (I don't have the knowledge to tell that), maybe not, but I thought it's worth asking :)

@mmakaay
Copy link
Author

mmakaay commented Apr 15, 2021

I have looked at ESPAsyncTCP while I was tracking this issue, and it does use the same construction to keep track of ACKs.
I concluded that is might therefore suffer from the same kind of disconnection issues (does it?).

However, a lot has been done to that code, and I saw some very specific comments about possible crashes and how to prevent them. The two libraries looked so different at places, that I could not decide right away if the same fix was needed.
If this type of disconnect error is occurring for ESP8266 too, then I'm pretty sure that the same fix can be applied.

Somewhere in the back of my mind, I made a mental note about checking out the actual differences between these two libraries, because when looking at them, I wondered if the ESP32 AsyncTCP library wouldn't benefit from some of the developments in the ESPAsyncTCP library.

@mmakaay
Copy link
Author

mmakaay commented Apr 22, 2021

Hey @OttoWinter , based on the revert of the AsyncTCP version bump by @glmnet : Do you agree it would be best to revert the patch from this pull request? It's no good if there are devices that show regression in the form of extra disconnects.

@OttoWinter
Copy link
Owner

@mmakaay Agree, though it would make sense to understand how that regression could be caused too. Maybe we're encountering another race condition now because some code is executed on another ESP32 core in parallel (or interleaved execution), so it might make sense to use mutexes in this code.

@mmakaay
Copy link
Author

mmakaay commented May 3, 2021

Yes, I think that mutexes might improve this.
I have tried the code on all of my devices and for me this improved their API connections. So I wasn't able to trigger the regression myself. What would be interesting to know, is what exact messages are logged on the console at the time of the disconnects. Since that would tell us if the disconnects were related to the ACK handling.

What I didn't look into further so far, is how this setup would work when actual ACKs got missing. Where are the retransmits for this arranged? Should lwIP handle these? The AsyncTCP ACK handling is considering a sunny day scenario, where no ACKs get lost, and considers a lost ACK as a reason to kick the connection. That is not the best scenario to work with. Possibly the regression hits when there are actual packages getting lost on the network? It shouldn't kill the connection in such case though, since the whole reason for TCP is allowing for such issues. From a TCP perspective, the connection might be considered sub-optimal, but very much usable nonetheless.

I can dive a bit deeper in the stack and see what lwIP actually does. If that layer does handle ACKs and retransmits, I'm not so sure that ACK handling event ought to be in AsyncTCP to begin with.

@OttoWinter
Copy link
Owner

Yes lwIP does handle retransmits :)

iirc the reason for it is for cleanup purposes. With the ack logic you know when it's safe to write to a buffer again (if the buffer wasn't copied to lwIP's internal one before). But not 100% sure about that anymore.

@glmnet
Copy link

glmnet commented May 5, 2021

I managed to reproduce this issue by introducing a delay in the send

bool AsyncClient::send()
{
    int8_t err = ERR_OK;
    err = _tcp_output(_pcb, _closed_slot);
    if (err == ERR_OK)
    {
        vTaskDelay(50 / portTICK_RATE_MS);
        // The 50 is to give the network plenty of time to send the ack,
        // I tried with 25 and there were no timeouts
        _pcb_busy = true;
        _pcb_sent_at = millis();
        return true;
    }
    return false;
}

And fixed it with this (@mmakaay ideas)

bool AsyncClient::send()
{
    int8_t err = ERR_OK;
    auto pcb_sent_at_back = _pcb_sent_at;
    auto pcb_busy_back = _pcb_busy;
    _pcb_sent_at = millis();
    _pcb_busy = true;
    err = _tcp_output(_pcb, _closed_slot);
    if (err == ERR_OK)
    {
        return true;
    }
    _pcb_busy = pcb_busy_back;
    _pcb_sent_at = pcb_sent_at_back;
    return false;
}

also marked _pcb_busy and _pcb_sent_at as volatile

  volatile bool _pcb_busy;
  volatile uint32_t _pcb_sent_at;

_pcb_busy is there just to tell us that _pcb_sent_at has a valid value, and ack timeout should be observed, no harm at all, so both values are backup in case tcp_output fails.
In case a timeout is going on and tcp_output fails, the timeout will be "delayed" but values will be reverted and thus the timeout will happen even if send is called quickly

@mmakaay
Copy link
Author

mmakaay commented May 5, 2021

Was making them volatile required to make it work correctly?

Last night, I came up with a slighly different and possibly cleaner fix. I will explain my code below here.

Before I explain the code, here a consideration that I included in this change:
On thing I found weird, is that the ACK check presumably does a check on ACKs, but it uses _rx_last_packet for this check. That field is updated on both incoming data packets and incoming ACKs. Thus, we might still be waiting for an ACK, but an incoming data packet would clear the timeout.
Likewise, the RX timeout would be cleared by an incoming ACK, however this might be fine (when the check implements the use case "check if we see any traffic coming in from the client").

On to the code changes:

  • Removed _pcb_busy, because it is not needed for the timeout logic.
  • Renamed _pcb_sent_at to _tx_last_packet. That naming is more in line with the existing _rx_last_packet.
  • Introduced a field _rx_last_ack, to let the ACK timeout check for ACK, not for "any packet incoming", as per my consideration from above.

This makes the block of related fields from the AsyncTCP.h:

    bool _ack_pcb;
    uint32_t _tx_last_packet;
    uint32_t _rx_ack_len;
    uint32_t _rx_last_packet;
    uint32_t _rx_since_timeout;
    uint32_t _rx_last_ack;
    uint32_t _ack_timeout;
    uint16_t _connect_port;

A few side-remarks:

  • for fun, check the use of _ack_pcb. It is always true, and there's one conditional that will always fail because of that. This looks like code that needs to be cleaned up.
  • rx_since_timeout could be renamed to rx_timeout. The since bit is confusing to me.

I removed all uses of the _pcb_busy.
Next to that, the code from AsyncTCP.cpp that I changed is:

A much simplified send method, which now register the last transmitted packet if a send succeeds.

bool AsyncClient::send(){
    auto backup = _tx_last_packet;
    _tx_last_packet = millis();
    if (_tcp_output(_pcb, _closed_slot) == ERR_OK) {
        return true;
    }
    _tx_last_packet = backup;
    return false;
}

The method that handles an incoming ack now updates the _rx_last_ack:

int8_t AsyncClient::_sent(tcp_pcb* pcb, uint16_t len) {
    _rx_last_packet = millis();
    _rx_last_ack = millis();
    // ... snip
}

The logic for the ACK timeouts now checks if the last incoming ack time is before the last sent packet time. That eiliminates the need for a boolean to enable the ACK timeout check.

int8_t AsyncClient::_poll(tcp_pcb* pcb){
    // ... snip
    // ACK Timeout
    if(_ack_timeout && _tx_last_packet > _rx_last_ack && (now - _tx_last_packet) >= _ack_timeout){
        log_w("ack timeout %d", pcb->state);
        if(_timeout_cb)
            _timeout_cb(_timeout_cb_arg, this, (now - _tx_last_packet));
        return ERR_OK;
    }
    // ... snip
}

This code is giving me very stable connections. I have 8 OTA loggers and HA active, and they now run as stable for me as when I fully disable the ACK timeout checks.

Do you want me to bundle this up as a new pull request?

@glmnet
Copy link

glmnet commented May 5, 2021

The volatile added just by instinct.

Regarding your code you should check your timestamp comparison. Otherwise looks good to me.

_tx_last_packet > _rx_last_ack

should check for rollovers, see https://arduino.stackexchange.com/questions/12587/how-can-i-handle-the-millis-rollover

@mmakaay
Copy link
Author

mmakaay commented May 5, 2021

Good point. I was wondering if it would be worth it, but let's prevent that once per 50 days possible disconnect issue 👍
How about this?

    if(_ack_timeout){
        uint32_t one_day = 86400000;
        bool last_tx_is_after_last_ack = (_rx_last_ack - _tx_last_packet + one_day) < one_day;
        if(last_tx_is_after_last_ack && (now - _tx_last_packet) >= _ack_timeout) {
            log_w("ack timeout %d", pcb->state);
            if(_timeout_cb)
                _timeout_cb(_timeout_cb_arg, this, (now - _tx_last_packet));
            return ERR_OK;
        }
    }

@mmakaay
Copy link
Author

mmakaay commented May 5, 2021

Okay, a followup pull request was created at #5

I did not include the volatile flags, because I do not think they would be useful.
Reading up on the combination of volatile plus threading behavior in this Stack Overflow post didn't convince me they would.
I you do want to have those variables volatile anyhow, let met know in the new PR and I can update it.

BlueAndi pushed a commit to BlueAndi/AsyncTCP that referenced this pull request Jun 10, 2021
After my first attempt at fixing the client disconnects
(OttoWinter#4) got merged
into AsyncTCP, it turned out that there was regression
for some devices: the connection stability actually went
down instead of up.

After a lot of debugging and discussion with @glmnet (some of
the results can be found in the above pull request discussion),
we came up with an improved fix for the disconnect issues.

**Changed:**

The code that checks for ACK timeouts has been simplified in
such way, that only two timestamps are now used to determine if
an ACK timeout has happened: the time of the last sent packet
(this was already recorded), and the time of the last received
ACK from the client (this has been added).

Using these timestamps, there is no more need for a separate field
to keep track if we are waiting for an ACK or not (`_pcb_busy`).
Therefore, this field was completely removed from the code.

While I was at it, I renamed a few variables to make the code
easier to read and more consistent.

**Results:**

I connected Home Assistant plus 8 OTA loggers at the same time,
using very verbose logging output. This normally was an easy way
to trigger the disconnect errors.

It turned out, this solution runs as solid for me, as when disabling
the ACK timeout checks completely
(using `AsyncClient::setAckTimeout(0)`).

(cherry picked from commit 6f37967)
HeMan pushed a commit to HeMan/AsyncTCP that referenced this pull request Dec 25, 2023
After my first attempt at fixing the client disconnects
(OttoWinter#4) got merged
into AsyncTCP, it turned out that there was regression
for some devices: the connection stability actually went
down instead of up.

After a lot of debugging and discussion with @glmnet (some of
the results can be found in the above pull request discussion),
we came up with an improved fix for the disconnect issues.

**Changed:**

The code that checks for ACK timeouts has been simplified in
such way, that only two timestamps are now used to determine if
an ACK timeout has happened: the time of the last sent packet
(this was already recorded), and the time of the last received
ACK from the client (this has been added).

Using these timestamps, there is no more need for a separate field
to keep track if we are waiting for an ACK or not (`_pcb_busy`).
Therefore, this field was completely removed from the code.

While I was at it, I renamed a few variables to make the code
easier to read and more consistent.

**Results:**

I connected Home Assistant plus 8 OTA loggers at the same time,
using very verbose logging output. This normally was an easy way
to trigger the disconnect errors.

It turned out, this solution runs as solid for me, as when disabling
the ACK timeout checks completely
(using `AsyncClient::setAckTimeout(0)`).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants